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Abstract 

As the number of processor cores on supercomputers becomes larger and larger, 
algorithms with high degree of parallelism attract more attention. In this work, we 
propose a novel space-time coupled algorithm for solving an inverse problem associ¬ 
ated with the time-dependent convection-diffusion equation in three dimensions. We 
introduce a mixed finite element/finite difference method and a one-level and a two- 
level space-time parallel domain decomposition preconditioner for the Karush-Kuhn- 
Tucker (KKT) system induced from reformulating the inverse problem as an output 
least-squares optimization problem in the space-time domain. The new full space ap¬ 
proach eliminates the sequential steps of the optimization outer loop and the inner 
forward and backward time marching processes, thus achieves high degree of paral¬ 
lelism. Numerical experiments validate that this approach is effective and robust for 
recovering unsteady moving sources. We report strong scalability results obtained on 
a supercomputer with more than 1,000 processors. 
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1 Introduction 


In this paper, we consider an inverse problem associated with the time-dependent convection- 
diffusion equation defined in G R^: 


dC 

— = V ■ (a(x)VC) - V ■ (v(x)C) + f(x,t), 

C{x,t) = p{x,t), xGTi 
dC 

a(x)-^ = X G r2 

(^(x, 0) = C'o(x), xGtl, 


0 < t < T, xGtl 


( 1 ) 


where /(x, t) is the source term to be recovered, a(x) and v(x) are the given diffusivity and 
convective coefficients. Ti and r 2 are two disjoint parts of the boundary dQ. Dirichlet 
and Neumann boundary conditions are imposed respectively on Fi and r 2 . When the 
observation data C{x, t) is available at certain locations, several classes of inverse problems 
associated with the convection-diffusion equation ([T|) have been investigated, such as the 
recovery of the diffusivity coefficient with applications in, for examples, laminar wavy film 
flows [2D], and flows in porous media m, the recovery of the source with applications in, 
for examples, convective heat transfer problems [25], indoor airborne pollutant tracking 
[23] . ground water contamination modeling [291 [32[ jSS][39] , efc. 

The main focus of this work is to study the following inverse problem: given the mea¬ 
surement data C^(x,t) of C(x,t) at some locations inside Q for the period 0 < f < T (e 
denotes the noise level), we try to recover the time-varying source locations and intensities, 
i.e., the source function /(x, t) in equation ([T]). In the last decades, several types of numer¬ 
ical methods have been developed for retracing the sources, such as the explicit method 
|19j . the quasi-reversibility method [33], the statistical method [35] and the Tikhonov 
optimization method [i da EHi [36]. Among these methods, the Tikhonov optimization 
method is the most popular one, which reformulates the original inverse source problem 
into an output least-squares optimization problem with PDE-constraints, and by including 
appropriate regularizations it ensures the stability of the resulting optimization problem 
[13 [37]. Various techniques are available for solving the induced first-order optimality 
system [3 HE] 0122]. 
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We define the following objective functional with Tikhonov optimization; 

= [ [ A{x){C{x,t)-C^{x,t))^dxdt + Np{f), (2) 

^ Jo Jn 

where ^(x) is the data range indicator function, namely ^(x) = Yli=i ~ xi, 

X 2 , • • •, x^ are a set of specified locations, where the concentration C{x,t) is measured 
and denoted by C^(xi,t). The term Np[f) in ([2]) is called the regularization with respect 
to the source. Since f{x,t) depends on both space and time, we propose the following 
space-time regularization: 

^/3(/) = ^ [ [ l/(x,t)pdxdt-b ^ [ [ \Vf\‘^dxdt. (3) 

Jo Jn ^ Jo Jn 

Here /3i and (32 are two regularization parameters. Other regularizations, such as 
may be used, but we will show later by numerical experiments that regularization 

may offer better numerical reconstructions. 

Traditionally the problem (H])-® is solved by a reduced space sequential quadratic 
programming(SQP) method |12l I36j . which can be described as follows: 

Optimization loop (sequential) 

Step 1: Solve a forward-in-time state equation 

Loop in time (sequential) 

- Solve the steady-state equation for this time step (parallel) 

End loop 

Step 2; Solve a backward-in-time adjoint equation 
Loop in time (sequential) 

- Solve the steady-state equation for this time step (parallel) 

End loop 

Step 3: Solve objective equation (parallel) 

End loop 
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Parallelization strategies for solving the problem with reduced space SQP methods 
include by keeping the sequential steps of the outer loop and applying parallel-in-space 
algorithms such as domain decomposition, multigrid methods to the subsystems at each 
time step [2]. Reduced space SQP methods for unsteady inverse problems needs repeatedly 
solving the state equation, the adjoint equation and the objective equation, thus it divides 
the problem into many subproblems, the memory cost is low. However reduced space 
SQP methods sometimes are quite time-consuming to achieve convergence. Because of 
the sequential steps in the optimization loop and in the forward and backward time¬ 
marching processes, it is less ideal for parallel computers with a large number of processor 
cores compared to full space SQP methods. Full space method in pin] has been studied 
for steady state problems, for unsteady problems, it needs to eliminate the sequential steps 
in the outer loop and solve the full space-time problem as a coupled system. Because of 
the much larger size of the system, the full space approach may not be suitable for small 
computer systems, but it has fewer sequential steps and thus offers a much higher degree 
of parallelism required by large scale supercomputers. 

Finding suitable parallelization strategies for the optimization loop and the inner time 
loops is an active research area. An unsteady PDF-constrained optimization problem 
was solved in [38] for the boundary control of unsteady incompressible flows by solving 
a subproblem at each time step. It has the sequential time-marching process and each 
subproblem is steady-state. The parareal algorithms were studied in [3 El EH], which 
involve a coarse (coarse mesh in the time dimension) solver for prediction and a fine (fine 
mesh in the time dimension) solver for correction. Parallel implicit time integrator method 
(PITA), space-time multigrid, multiple shooting methods can be categorized as improved 
versions of the parareal algorithm [IlKIT]. The parareal algorithm combined with domain 
decomposition method [2^ or multigrid method can be powerful. So far, most references 
on parareal related studies focus mainly on the stability and convergence [ 13 - 

In this paper, we propose a fully implicit, mixed finite element and finite difference 
discretization scheme for the continuous KKT system, and a corresponding space-time 
overlapping Schwarz preconditioned solver for the unsteady inverse source identification 
problem in three dimensions. The method removes all the sequential inner time steps and 
achieves fnll parallelization in both space and time. We study the most general form of 
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the source function, in other words, we reconstruct the time history of the distribution 
and intensity profile of the source simultaneously. Furthermore, to resolve the dilemma 
that the number of linear iterations of one-level methods increases with the number of 
processors, we develop a two-level space-time hybrid Schwarz preconditioner which offers 
better performance in terms of the number of iterations and the total compute time. 

The rest of the paper is arranged as follows. In Section[2l we describe the mathematical 
formulation of the inverse problem and the derivation of the KKT system. We propose, in 
Section [3l the main algorithm of the paper, and discuss several technical issues involved 
in the fully implicit discretization scheme and the one- and two-level overlapping Schwarz 
methods for solving the KKT system. Numerical experiments for the recovery of 3D 
sources are given in Section 01 and some concluding remarks are provided in Section [5l 

2 Strong formulation of KKT system 

We formally write ([T]) as an operator equation L{C, f) = 0. For G G the following 

Lagrange functional mm transforms the PDE-constrained optimization problem ([2]) into 
an unconstrained minimization problem. Let 

J{CJ,G) = l r [ ^(x)(C(x,t) -C'^(x,t))2dxdt 

^ Jo Jn ( 4 ) 

+ Np{f) + iG,L{C,f)), 

where G is a Lagrange multiplier or adjoint variable, and {G, L{G, f)) denotes their in¬ 
ner product. Two approaches are available for solving (0|), the optimize-then-discretize 
approach and the discretize-then-optimize approach. The first approach derives a contin¬ 
uous optimality condition system and then applies certain discretization scheme, such as a 
finite element method to obtain a discrete system ready for computation. The second ap¬ 
proach discretizes the optimization function J^ and then the objective functional becomes 
a finite dimensional quadratic polynomial. The solution algorithm is then based on the 
polynomial system. The two approaches perform the approximation and discretization 
at different stages, both have been applied successfully [28], we use the optimize-then- 
discretize approach in this paper. 

The first-order optimality conditions for (0]), i.e., the KKT system, is obtained by 
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taking the variations with respect to G, C and / as 


/ 

Jg{CJ,G)v = 0 

<Jc{GJ,G)w = 0 (5) 

JfiG,f,G)g = 0 

\ 

for all v,w ^ T ; (fi)) with zero traces on Fi and g € T ; H^{Q.)). 

Using integration by part, we obtain the strong form of the KKT system: 

_ _ V • (aVC) + V • (v(x)C) - / = 0 
BG 

< - V • (aVG) - v(x) • VG + A{x)C = A{x)G^ (6) 

G + l3i^+l32Af = 0. 

To derive the boundary, initial and terminal conditions for each variable of the equations, 
we make use of the property that ([5|) holds for arbitrary directional functions v, w and 
g. For the state equation (i.e. the first one of d^D or ([U])) it is obvious to maintain the 
same conditions given by ©• For the adjoint equation (the second one of ([5]) or ([6])), by 
multiplying the test function w G L^(0, T; (fi)) with rc(',0) = 0, a(x) £ = 0 ™ n, 
we have 


Jc{C,f,G)w= f f A{x){G{x,t) — G'^{x,t))wdxdt + f G{x,T)w{x,T)d: 
Jo Jn Jn 


BG \ 

+ V • (a(x)VG) + v(x) • VG j wdxdt 


+ 


Ti 

/7 

/o Jr2 


BG 

a(x)—-h v(x) ■ n ) wdVdt. 

an 


By the arbitrariness of w, the boundary and terminal conditions for G are derived: 


G(x,t) =0, X G Fi, t G [0,T] 

BG 

a(x)— + v(x) • n = 0, x G F 2 , t G [0, T] 
G(x, T) = 0, X G 0 . 
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Similarly for the third equation of ([5]) or Q, we can deduce 


Jf{C,f,G)g = - [ f Ggdxdt+ [ [ {fg + Vf-Vg)dxdt 

Jo Jn Jo Jn 

= - f [ Ggdyidt + {fg)\t=o,T - [ [ fg 

Jo Jn Jo Jn 


+ 


rT r Qf 


r f ^gdTdt - r [ Afgd^dt 

Jo Jan Jq Jq 

= -[ [ {G + f + Af)gd:>Qdt +{fg)\t=o,T + [ [ ^gdVdt. 

Jo Jn Jo Jan c'n 

Using the arbitrariness of g, we derive the boundary, initial and terminal conditions for /: 

df df 

— =0 for t = 0, T, X G n ; — =0 for x G did, t G [0, T] . (7) 

ot on 


3 A fully implicit and fully coupled method 

In this section, we first introduce a mixed finite element and finite difference method for 
the discretization of the continuous KKT system derived in the previous section, then we 
briefly mention the algebraic structure of the discrete system of equations. In the second 
part of the section, we introduce the one- and two-level space-time Schwarz preconditioners 
that are the most important components for the success of the overall algorithm. 


3.1 Fully-implicit space-time discretization 

In this subsection, we introduce a fully-implicit hnite element/finite difference scheme to 
discretize (jh]). To discretize the state and the adjoint equations, we use a second-order 
Crank-Nicolson finite difference scheme in time and a piece-wise linear continuous finite 
element method in space. Consider a regular triangulation of domain Q, and a time 
partition P'^ of the interval [0,r]: 0 = G < < ■ ■ ■ < = T, with G = nT,T = T/M. 

Let be the piecewise linear continuous hnite element space on T^, and be the 
subspace of with zero trace on Ti. We introduce the difference quotient and the 
averaging of a function as 


drn^) = 




, t)dt, 


with := Let vr/i be the hnite element interpolation associated with the 

space V^, then we obtain the discretizations for the state and adjoint equations by hnding 
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the sequence of approximations G such that = iThCo, Gff = 0, and 

C'^(x) = Trhp{x,G),G^{x) = 0 for X G Fi, and satisfying 

/ 

{drGl^vn) + (aVC-, Vu/,) + (V • (yCl),Vh) = + (^,^h)r2, Vu/, G 

< -{drGl.Wh) + {aVGl.Vwh) + (V • {ywn).Gl) 

= -(A(x)(C-(x,t) - C^’-(x,t)),u;;,), Mwk G 

( 8 ) 

Unlike the approximations of the forward and adjoint equations in ([5]), we shall ap¬ 
proximate the source function / differently. We know that the source function satisfies an 
elliptic equation (see the third equation in (l6|)) in the space-time domain U x (0, T). So we 
shall apply x P'^ to generate a partition of the space-time domain 0 x (0, T), and then 
apply the piecewise linear finite element method in both space (three dimensions) and 
time (one dimension), denoted by to approximate the source function /. Then the 
equation for / G can be discretized as follows: Find the sequence of for n = 0,1, 
■ ■ ■, M such that 

-{Gl,gl) + Pi{drfJi,drgl) + (32{VfJi,Vgl)=0, ^gl^WJ,. (9) 

The coupled system ([8|)-(l9]) is the so-called fully discretized KKT system. In the Appendix, 
we provide some details of the discrete structure of this KKT system. 

3.2 One- and two-level space-time Schwarz preconditioning 

Usually, the unknowns of the KKT system ([5D-([2D are ordered physical variable by physical 
variable, namely in the form 

Such ordering are used extensively in SQP methods [12]. In our all-at-once method, the 
unknowns G, G and / are ordered mesh point by mesh point and time step by time step, 
and all unknowns associated with a point stay together as a block. At each mesh point Xj, 
j = 1, ■■■, N, and time step G, n = 0, ■ ■ ■, M, the unknowns are arranged in the order 
of G^,G'j,f^. Such ordering avoids zero values on the main diagonal of the matrix and 
has better cache performance for point-block LU (or ILU) factorization based subdomain 
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solvers. More precisely, we define the solution vector 


TT _ //^O /oO fO /oO /oO .fO .rl /^1 /^1 .fl /^M rM 

^ — V^l 5 ^1?/l 5 • • • ? ^iV5 /AT, 5 Jl r • • IN')’ “ 5^1 5/1 5 

/^M fM\T 

* " ? ^A^ 5 ’>Jn) • 


then the linear system ([8|)-(l9|) is rewritten as 


FhU = h, 


( 10 ) 


where is a sparse block matrix of size (M + 1)(3A^) by (M + 1)(3A^) with the following 
block structure: 


<500 <5oi 0 

.5io 5ii 5 i2 


0 ^ 
0 


Fh = 


0 
0 
V 0 


0 


.5m-i ,M-2 Sm- 1,M-1 

0 Sm,m-^ 


Sm,m / 


where the block matrices Sij for 0 < i,j < M are of size 3N x 3N and most of its elements 
are zero matrices except the ones in the tridiagonal stripes It is 

noted that if we denote the submatrices for C, G and / of size N x N respectively by 
Sij:S^,S{j in each block Sij, the sparsity of the matrices are inconsistent, namely, Sfj 
is the densest and is the sparest. This is due to the discretization scheme we have 
used. The system (jlOp is large-scale and ill-conditioned, therefore is difficult to solve 
because the space-time coupled system is denser than the decoupled system, especially for 
three dimensional problems. We shall design the preconditioner by extending the classical 
spatial Schwarz preconditioner to include both spaial and temporal variables. Such an 
approach eliminates all sequential steps and the unknowns at all time steps are solved 
simultaneously. We use a right-preconditioned Krylov subspace method to solve (fTUIl . 


FhM-^U' = b, 


where M~^ is a space-time Schwarz preconditioner and U = M~^U'. 

Denoting the space-time domain by 0 = D x (0,T), an overlapping decomposition of 
0 is defined as follows: we divide D into Ng subdomains, Di, D 2 , • • •, then partition 
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Figure 1: Left: a sample overlapping decomposition in space-time domain 0 on a fine 
mesh. Right: the same decomposition on the coarse mesh. 

the time interval [0, T] into Nt subintervals using the partition: 0 < Ti < T 2 < • • • < 
T/Vj. By coupling all the space subdomains and time subintervals, a decomposition of 
0 is 0 = where Qij = flj x (Tj-i,Tj). For convenience, the number of 

subdomains, i.e. NgNt, is equal to the number of processors. These subdomains 0jj are 
then extended to 0T to overlap each other. The boundary of each subdomain is extended 
by an integral number of mesh cells in each dimension, and we trim the cells outside of 
0. The corresponding overlapping decomposition of 0 is 0 = U^^(U^]^0T). See the left 
figure of Figure [U for the overlapping extension. 

The matrix on each subdomain 0T = fl' x Tj), i = 1,2, ■ ■ ■, Ng, j = 1,2, ■■■ , 

Nt is the discretized version of the following system of PDFs 
' f)r< 

— = V ■ (a(x)VC) - V • (v(x)C) + /(x, t) , (x, t) G 0T 

— = -V-(a(x)VG)-v(x).VC 

< dt (11) 

+ ^(x)(C(x, t) - C^(x, t)), (x, t) € 0F 

/3i^ + /32A/ + G = 0, (x,f)G0L 
with the following boundary conditions 

C(x,t)=0, G(x,t)=0, f{^,t)=0, t€[T'_t,T'] (12) 
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along with the initial and terminal time boundary conditions 

/ 

^(x, r'_i) = 0, G(x, r'_i) = 0, /(x, r'_i) = o, x e Sbi' 

< ^ ^ ^ (13) 

C(x,r') = 0, G(x,r') = o, /(x,T') = o, xeab!'. 

One may notice from m that the homogenous Dirichlet boundary conditions are 
applied in each time interval so the solution of each subdomain problem is 

not really physical. This is one of the major differences between the space-time Schwarz 
method and the parareal algorithm [23]. The time boundary condition for each subproblem 
of the parareal algorithm is obtained by an interpolation of the coarse solution, and if the 
coarse mesh is fine enough, the solution of the subdomain problem is physical. As a result, 
the parareal algorithms can be used as a solver, but our space-time Schwarz method 
can only be used as a preconditioner. Surprisingly, as we shall see from our numerical 
experiments in Section 0] the Schwarz method is an excellent preconditioner even though 
the time boundary conditions violate the physics. 

We solve the subdomain problems the same as for the global problem (1101) . no time¬ 
marching is performed in our new algorithm, and all unknowns affiliated with each sub- 
domain are solved simultaneously. Let Mjj be the matrix generated in the same way as 
the global matrix in (fTOjl but for the subproblem (fTT]) - (fT3]) . and be an exact or 
approximate inverse of Mij. Denoting the restriction matrix from 0 to the subdomain 0L 
by Rfj, with overlapping size 5, the space-time restricted Schwarz preconditioner [TO] can 
be now formulated as 

Nt Ns 
i=i*=i 

As it is well known, any one-level domain decomposition methods are not scalable with 
the increasing number of subdomains or processors. Instead one should have multilevel 
methods in order to observe possible scalable effects mm- We now propose a two-level 
space-time additive Schwarz preconditioner. To do so, we partition D with a fine mesh 

and a coarse mesh For the time interval, we have a fine partition P'^ and a coarse 
partition P'^‘^ with t < Tc- We will adopt a nested mesh, i.e., the nodal points of the coarse 
mesh x are a subset of the nodal points of the fine mesh x P'^. In practice, the 
size of the coarse mesh should be adjusted properly to obtain the best performance. On 
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the fine level, we simply apply the previously defined one-level space-time additive Schwarz 
preconditioner; and to efficiently solve the coarse problem, a parallel coarse preconditioner 
is also necessary. Here we use the overlapping space-time additive Schwarz preconditioner 
and for simplicity divide x P'^‘^ into the same number of subdomains as on the fine level, 
using the non-overlapping decomposition 0 = When the subdomains are 

extended to overlapping ones, the overlapping size is not necessarily the same as that 
on the fine mesh. See the right figure of Figure [T] for a coarse version of the space-time 
decomposition. We denote the preconditioner for the coarse level by which is defined 
by 

Nt Ns 


M7^ = 




T W—1 dO 


j=li=l 


where 6c is the overlapping size on the coarse mesh. Here the matrix M-- \ is an approxi- 
mate inverse of Mij^c which is obtained by a discretization of (Illll - (ll3p on the coarse mesh 
on 0F. 

To combine the coarse preconditioner with the fine mesh preconditioner, we need a 
restriction operator from the fine to coarse mesh and an interpolation operator 
from the coarse to fine mesh. For our currently used nested structured mesh and linear 
finite elements, is easily obtained using a linear interpolation on the coarse mesh and 
~ note that when the coarse and fine meshes are nested, instead of using 

~ we may take to be a simple restriction, e.g., the identity one which assigns 

the values on the coarse mesh using the same values on the fine mesh. In general, the coarse 
preconditioner and the fine preconditioner can be combined additively or multiplicatively. 
According to our experiments, the following multiplicative version works well: 


/ 

y = 

= y + ^^e-leveli^ “ ^hV) , 


(14) 


where ^ corresponds to the GMRES solver right-preconditioned by ^ on the coarse 
level, and FX is the discrete KKT system (|10p on the fine level. 
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4 Numerical experiments 


In this section we present some numerical experiments to study the parallel performance 
and robustness of the newly developed algorithms. When using the one-level precondi¬ 
tioner, we use a restarted GMRES method (restarts at 50) to solve the preconditioned 
system; when using the two-level preconditioner, we use the restarted flexible GMRES 
(fGMRES) method [30] (restarts at 30), considering the fact that the overall precondi¬ 
tioner changes from iteration to iteration because of the iterative coarse solver. Although 
fGMRES needs more memory than GMRES, we have observed its number of iterations 
can be significantly reduced. The relative convergence tolerance of both GMRES and 
fGMRES is set to be 10“®. The initial guesses for both GMRES and fGMRES method 
are zero. The size of the overlap between two neighbouring subdomains, denoted by iovlp, 
is set to be 1 unless otherwise specified. The subsystem on each subdomain is solved by 
an incomplete LU factorization ILU(A;), with k being its fill-in level, and /c = 0 if not 
specified. The algorithms are implemented based on the Portable, Extensible Toolkit for 
Scientific computation (PETSc) |6] using run on a Dawning TC3600 blade server system 
at the National Supercomputing Center in Shenzhen, China with a 1.271 PFlops/s peak 
performance. 

In our computations, the settings for the model system m are taken as follows. 
The computational domain, the terminal time and the initial condition are taken to 
be D = (—2,2)^, T = 1 and C'(-,0) = 0 respectively. Let L = S = H = 2, then 
the homogeneous Dirichlet and Neumann conditions in ([T|) are respectively imposed on 
Pi = {x = (xi,X 2 ,X 3 ); |xi| = L or |x 2 | = S'} and r 2 = {x = (xi,X 2 ,X 3 ); |x 3 | = H}. 
Furthermore, the diffusivity and convective coefficients are set to be o(x) = 1.0 and 
v(x) = (1.0,1.0,1.0)^. 

In order to generate the observation data, we solve the forward convection-diffusion 
equation ([T]) on a very fine mesh with a small time step size, and the resulting approximate 
solution C{x,t) is used as the noise-free observation data. Then a random noise is added 
in the following form at the locations where the measurements are taken: 

C'^(xj,t) = C{xi,t) + erC{xi,t), i = !,■■■ ,s. 

Here r is a random function with the standard Gaussian distribution, and e is the noise 
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Figure 2: The traces of two moving sources. 


level. In our numerical experiments, e = 1% if not specified otherwise. 

The numerical tests are designed to investigate the reconstruction effects with different 
types of three-dimensional sources by the proposed one- or two-level space-time Schwarz 
method, as well as the robustness of the algorithm with respect to different noise level, 
different regularizations and different amount of measurement data. In addition, parallel 
efficiency of the proposed algorithms is also studied. 

4.1 Reconstruction of 3D sources 

We devote this subsection to test the numerical reconstruction of three representative 3D 
sources by the proposed one-level space-time method, with np = 256 processors. Each of 
the three examples are constructed with its own special difficulty. 

Example 1: two Gaussian sources. This example tests two moving Gaussian 
sources in D, namely the source / takes the form: 

/(X. t) = ^.xp( - + fa 

i=l 

with a = 2.0 and two moving centers of the sources are given by 

(xi,yi,zi) = (L sin(27rt). S'cos(27rt), R cos(47rt)) 

< (15) 

{X 2 ,y 2 -,Z 2 ) = {L - 2L\ cos(4t)|, —S + 2S| cos(4t)|, —H + 2Ht^). 

\ 

The moving traces of the sources are shown in Figure [2l 

In the first experiment, we use the mesh 40 x 40 x 40 and the time step size of 1/39 for 
the inversion process. And the measurements are taken on the mesh 14 x 14 x 14, which 
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Figure 3: Example 1: the source reconstructions at three moments t = 

10/39, 20/39,30/39 with measurements collected at the mesh 14 x 14 x 14 (bottom), com¬ 
parable with the exact source distribution (top). 

is uniformly located in hi. The regularization parameters are set to be /3i = 3.6 x 10“^ 
and /32 = 3.6 x 10“^. In Figure O the numerically reconstructed sources are compared 
with the exact one at three moments t = 10/39,20/39,30/39. We can see that the source 
locations and intensities are quite close to the true values at three chosen moments. Then 
we increase the noise level to e = 5% and e = 10%, still with the same set of parameters. 
The reconstruction results are shown in Figured! We can observe that the reconstructed 
profiles deteriorate and become oscillatory as the noise level increases. This is naturally 
expected since the ill-posedness of the inverse source problem increases with the noise 
level. 

Example 2: Four constant sources. Appropriate choices of regularizations are 
important for the inversion process. In the previous example we have used a 
Tikhonov regularization in both space and time. In this example, we intend to compare 
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Figure 4: Example 1: source reconstructions with noise level e = 5% (top) and e = 10% 
(bottom). 

the regularization with the following H^-L^ regularization 

^/3(/) = V / / \f(^,t)?dxdt + ^ f [ fdxdt. 

^ Jo Jn ^ Jo Jn 

For the comparisons, we consider the case in which four constant sources move along the 
diagonals of the cube to their far corner. The four sources are specified by 

fi{x,t) = ai for |x-Xi|<0.4, \y-yi\<0A, \z-Zi\<^A 

where ai = 04 = 2.0, 02 = as = 1.0, and the traces of the four sources are described by 

/ 

(xi, yi, zi) = (—L + 2Lt, —S + 2S't, H — 2Ht) 

( 3 ^ 2 ,2/2, 22 ) = {L - ‘JLt, S - 2St, -H + 2Ht) 

< 

( 3 ^ 3 , 2 / 3 , 23 ) = {L- ‘JLt,-S + 2St, -H + 2Ht) 

(x 4 , y 4 , 24 ) = {—L + 2Lt, S — 2St, H — 2Ht). 

Same mesh and measurements are used as in Example 1, and the regularization parameters 
are set to be /3i = 10“®,/32 = 10 “^ in A^^(/), and jdi = 10“®,/32 = 10 “® in Np{f), 


16 


















Figure 5: Example 2: the source reconstructions with H^-H^ regularization (mid) and 
H^-L^ regularization (bottom), compared with the exact solution (top). 

respectively. The reconstruction results are compared with the true solution at three 
moments t = 10/39,20/39,30/39, and two slices at x = 0.95 and x = —0.95. It is 
observed from Figure [5] that the resolution of the source profile is much better with the 
H^-H^ regularization A(a(/) than with the H^-L? regularization and the latter 

presents a reconstruction process that is much less stable and much more oscillatory. 

Example 3: Eight moving sources. This last example presents a very challenging 
case that eight Gaussian sources are initially located at the corners of the physical cubic 
domain, then move inside the cube following their own traces given below. The Gaussian 
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sources are described by 


/(x,i) = 


,-(a;-a;i)2+(j/-j/i)2+(2-Zi)2 


2=1 


where the coefficients Oj and the source traces are represented by 


oi — 02 —03 — 04 — 4.0; 05 — og — 07 — og — 6.0, 

(xi, yi, zi) = (-L + 2L(1 - t ),-S + 2,S(1 - t), -H + 2F(1 - t)) 

{X2,y2, Z 2 ) = {-L + 2Lt, -S + 2St, -H + 2Ht) 

(x 3 , 7 / 3 , 23 ) = (—L + 2L cos( 7 rt)^(l — t), —5 + 25* sin( 7 rt)^t, —H + 2H cos( 7 rt)^(l — t)) 

(x 4 , ?/ 4 , 24 ) = (—L + 2L cos( 7 rt)^(l — t), —S + 25 cos( 7 rt)^(l — t), —H + 2H sin( 7 rt)^t)) 

(xg, ?/ 5 , 2 : 5 ) = (—L + 2L cos(27rt)^ cos /2t) , —5 + 25 sin( 7 r sin ( 7 r/ 2 t), 

—2/ + 2H sin( 7 rt)^ sin ( 7 r/ 2 t)) 

{xQ,yQ, zq) = (—L + 2L sin( 7 rt)^ sin ( 7 r/ 2 t), —5 + 25 cos(27r cos ( 7 r/ 2 t), 

—if + 2H sin( 7 rt)^ sin ( 7 r/ 2 t)) 

(x 7 , ?/ 7 , 27 ) = (—L + 2L sin( 7 rt)^ sin ( 7 r/ 2 t), —5 + 25 sin( 7 r sin ( 7 r/ 2 t), 

—ii + 2H cos(27rt)^ cos ( 7 r/ 2 t)) 

(xg, ys, + 2L sin( 7 rt)^ sin ( 7 r/ 2 t), —5 + 25 cos(27r cos ( 7 r/ 2 t), 

—ii + 2H cos(27ri)^ cos ( 7 r/ 2 i)) . 

We shall use the mesh 64 x 64 x 64 and the time step size 1/47, with two regularization 
parameters /3i = 3.6 x 10“® and (32 = 3.6 x 10“^. We compare the results recovered by two 
sets of measurements, collected at two meshes 21 x 21 x 21 and 9x9x9 respectively, which 
are both uniformly distributed in 11, with the exact solution shown in Figure [ 6 ] (top), at 
three time moments t = 0.0,10/47,1.0. Clearly better reconstructions are observed for 
the case with more measurements collected at the finer mesh 21 x 21 x 21 , though the 
coarser mesh 9 x 9 x 9 is good enough for locating the sources, only with their recovered 
source intensities smaller than the true values. 


4.2 Performance in parallel efficiency 

In the previous subsection, we have shown with 3 representative examples that the pro¬ 
posed algorithm can successfully recover the intensities and distributions of unsteady 
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Figure 6: Example 3: the source reconstructions with measurements collected at the 
mesh 21 X 21 X 21 (mid) and 9x9x9 (bottom), compared with the exact solution (top) 
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Table 1: Effects of ILU fill-in levels on the two-level method for Example 1 (columns 2-3), 
Example 2 (columns 4-5), and Example 3 (columns 6-7). 


ILU(A:) 

Its 

Time (sec) 

Its 

Time (sec) 

Its 

Time (sec) 

0 

47 

10.498 

55 

12.448 

81 

17.238 

1 

28 

33.633 

36 

47.766 

60 

49.622 

2 

18 

230.552 

23 

232.914 

48 

257.798 

3 

15 

1121.469 

20 

1132.841 

45 

1165.203 


sources and is robust with respect to the noise in the data, the choice of Tikhonov regular¬ 
izations and the number of measurements. These numerical simulations are all computed 
using the proposed one-level space-time method with np = 256 processors. In this section, 
we focus on our proposed two-level space-time method and study its parallel efficiency 
with respect to the number of ILU fill-in levels, namely the number k in ILU(/c), and the 
overlap size iovlp. We also compare the number of iterations and the total compute time 
of the one-level and two-level methods with increasing degrees of freedoms (DOFs) and 
the number of processors. 

First we will test how the number of fGMRES iterations and the total compute time 
of the two-level method change with different ILU fill-in levels. We use the coarse mesh 
21 X 21 X 21 with the time step 1/20, and the fine mesh 41 x 41 x 41 with the time step 
1/40 for Example 1, 2, and 3, and the overlap size iovlp = 1. We see that the total number 
of degrees of freedom on the fine mesh is 16 times of the one on the coarse mesh. Table 
[T] shows the comparison with np = 256 processors. Column 2-3, 4-5 and 6-7 present the 
results for Example 1, 2 and 3 respectively. It is observed that as the fill-in level increases 
the number of fGMRES iterations decreases, but the total compute time increases. When 
the fill-in level increases to 3, the compute time increases significantly and the number of 
iterations only reduces by 3 times. This suggests a suitable fill-in level to be ilulevel = 0 
or 1. 

Next we look at the impact of the overlap size. We still use the same fine and coarse 
meshes for all examples, and ILU(O) for the solver for each subdomain problem on both 
the coarse and fine meshes. The overlap size on the coarse mesh is set to be 1. We test 
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Table 2: Effects of the overlap size on the two-level method for Example 1 (columns 2-3), 
Example 2 (columns 4-5), and Example 3 (columns 6-7). 


iovlp 

Its 

Time (sec) 

Its 

Time (sec) 

Its 

Time (sec) 

1 

47 

10.498 

55 

12.448 

81 

17.238 

2 

39 

13.071 

51 

23.663 

69 

27.952 

4 

37 

27.423 

49 

45.225 

68 

47.032 


different overlap sizes on the fine level, and the results are given in Tabled It is observed 
that when the overlap size increases from 1 to 2 and then to 4, the number of fGMRES 
iterations decreases slowly and the total compute time increases. So we shall mostly use 
iovlp = 1 in our subsequent computations. 

Lastly we compare the performance of the one-level and two-level space-time Schwarz 
preconditioners in Tables [3] and 01 On the coarse level, a restarted GMRES is used, with the 
one-level space-time Schwarz preconditioner. ILU(O) is used as the local preconditioner 
on each subdomain and the coarse overlap size is set to be 1. A tighter convergence 
tolerance on the coarse mesh can reduce the number of outer fGMRES iterations, but 
often increases the total compute time. In the following numerical examples, we set the 
tolerance to be 10“^ and the maximum number of GMRES iterations to 4 on the coarse 
mesh. Moreover, the mesh size of the coarse mesh is also an important factor for the 
performance. If the mesh is too coarse, both the number of outer iterations and the total 
compute time increase; on the other hand, if the mesh is not coarse enough, too much time 
is spent for the coarse solver, the number of outer iterations may decrease significantly, 
but the compute time may increase. 

In the following experiments for Example 1, 2 and 3, we use three sets of fine meshes, 
33 X 33 X 33, 49 x 49 x 49 and 67 x 67 x 67, and the corresponding time steps are 1/32, 1/48 
and 1/66 respectively, while the coarse meshes are chosen to be 17 x 17 x 17, 17 x 17 x 17 
and 23 X 23 X 23, with the corresponding time steps being 1/16, 1/48 and 1/66. So the 
DOFs on the fine meshes are 16, 27 and 27 times of the ones on the coarse meshes for 
Example 1, 2 and 3 respectively. We use np = 64,128 and 512 processors for the three sets 
of meshes respectively and compare their performance with the one-level method in Table 
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[3j Savings in terms of the number of iterations and the total compute time are obtained 
for the two-level method with all three sets of meshes. As we observe that the number of 
iterations of the two-level method is mostly reduced by at least 4 times compared to the 
one for the one-level method, but the compute time is usually reduced by 2 to 4 times. 

Next we fix the space mesh to be 49 x 49 x 49 and the time step to be 1/48, resulting in 
a very large-scale discrete system with 17,294,403 DOFs. For the two-level method, we set 
the coarse mesh to be 17 x 17 x 17 with the time step 1/48, which implies that the DOFs 
on the fine mesh is about 27 times of the ones on the coarse mesh. Then the problem is 
solved with np = 128, 256,512, and 1024 processors respectively. The performance results 
of the one-level and two-level methods are presented in Table 01 We observe that when 
the number of sources is small, both the one-level and two-level methods are scalable 
with up to 512 processors, but the two-level method takes much less compute time. The 
strong scalability deteriorates when the number of processors is too large for the size of 
the problems. As the number of sources increases, the scalability becomes slightly worse 
for both one-level and two-level methods, even though the two-level method is still faster 
in terms of the total compute time. 

5 Concluding remarks 

In this work we have proposed and studied a new fully implicit, space-time coupled, mixed 
finite element and hnite difference discretization method, and a parallel one- and two- 
level domain decomposition solver for the three-dimensional unsteady inverse convection- 
diffusion problem. With a suitable number of measurements, this all-at-once approach 
provides acceptable reconstruction of the physical sources in space and time simultane¬ 
ously. The classical overlapping Schwarz preconditioner is extended successfully to the 
coupled space-time problem with a homogenous Dirichlet boundary condition applied on 
both the spatial and temporal part of the space-time subdomain boundaries. The one- 
level method is easier to implement, but the two-level hybrid space-time Schwarz method 
performs much better in terms of the number of iterations and the total compute time. 
Good scalability results were obtained for problems with more than 17 millions degrees of 
freedom on a supercomputer with more than 1,000 processors. The approach is promising 
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Table 3; Comparisons between the one-level and two-level space-time preconditioners for 
Examples 1-3 with different meshes. 


Exl 

np 

Mesh 

M 

level 

Its 

Time (sec) 

64 

33 X 33 X 33 

33 

1 

175 

53.635 




2 

57 

20.653 

128 

49 X 49 X 49 

49 

1 

346 

200.664 




2 

83 

47.812 

512 

67 X 67 X 67 

67 

1 

491 

675.985 




2 

105 

212.72 

Ex2 

np 

Mesh 

M 

level 

Its 

Time (sec) 

64 

33 X 33 X 33 

33 

1 

228 

72.338 




2 

77 

20.246 

128 

49 X 49 X 49 

49 

1 

365 

214.058 




2 

85 

47.078 

512 

67 X 67 X 67 

67 

1 

599 

841.652 




2 

121 

216.92 

Ex3 

np 

Mesh 

M 

level 

Its 

Time (sec) 

64 

33 X 33 X 33 

33 

1 

297 

82.834 




2 

76 

21.738 

128 

49 X 49 X 49 

49 

1 

405 

238.712 




2 

93 

57.244 

512 

67 X 67 X 67 

67 

1 

716 

872.766 




2 

137 

263.222 
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Table 4; Comparisons between the one-level and two-level space-time preconditioners for 
Examples 1-3 with different number of processors. 




Exl 


Ex2 


Ex3 


np 

level Its 

Time (sec) 

) Its 

Time (sec) 

1 Its 

Time (sec) 

128 

1 

346 

200.664 

365 

214.815 

405 

238.712 


2 

83 

47.812 

85 

47.072 

93 

57.244 

256 

1 

343 

127.035 

363 

152.334 

408 

145.213 


2 

82 

24.744 

87 

26.424 

90 

36.307 

512 

1 

343 

69.482 

363 

95.707 

400 

101.343 


2 

82 

16.461 

101 

19.453 

100 

18.611 

1024 

1 

351 

41.821 

393 

58.785 

433 

59.534 


2 

85 

10.132 

100 

11.352 

104 

15.815 


to more general unsteady inverse problems in large-scale applications. 
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A The discrete structure of the KKT system 

The KKT system ([8])-([9|) is formulated as follows: 

{drCJi,vn) + (aVC-, Vu;,) + (V • (vCJi),Vh) = {fj:,Vh) + {it,Vh)r„ 
-{drGl,Wh) + {aVGl,Vwh) + (V • {-^Wh),Gl) 

= -(^(x)(C'^(x,f) - C'^’”(x,t)),u;,j), ywh € 

-{Gl gl) + drgl) + Vgl) =0, G . 


(16) 


To better understand the discrete structure of (fTUD . we denote the identity and zero ma¬ 
trices as I and 0 respectively, and the basis functions of the finite element spaces and 


■ ■ ■, N and 

g^, 4 = 1, • • •, 

N,n = 

0, 

, M, respectively. 
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and based on these element matrices we define 


Ai = B + ^{A + E), A2 = -B + ^{A + E) 

B^ = B + ^-{A + E^), B2 = -B + ^{A + E^) 
i?3 = zeros except 1 at the measurement locations 


+ /32K" 


Then the system (11611 takes the following form 

( CO \ 

Cl 

(jM-2 
(jM-1 
CM 

CO 
Cl 
C2 


BC BG Bf 


CM—2 

qM-I 

CM 

f 


f 


M-2 


cM-l 


f 

V / 


/ 


CO 


{q A) r 2 

{q^A)T2 

t/2S3(C^’0 + C^’1) 

r/2S3(C^'ii^-2 + C"’i^-i) 
t/2B3{C^’^-^ + C^’^) 

CM 

0 
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where the block matrices BC,BG and Bf are given by 
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0 0 
0 0 
0 0 
0 0 


Bi B 2 


BG := 


V 


Bf := 
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0 0 

0 0 

0 0 

-1)01 
-Z)io _£)ii 
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0 0 

^ 0 0 

-iB -iB 

0 

0 0 

0 0 

0 0 

0 0 

0 0 

0 0 
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P^oo ^r0l 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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0 

0 

0 

0 

0 
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0 

0 

0 

0 

0 

0 
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Bi 
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0 

0 

0 

0 

0 

0 

0 

0 
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0 

0 

0 

0 

0 
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0 

0 

0 

0 

0 

0 

0 
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Bi 

0 

0 

0 

0 
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0 
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0 

0 
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